feat: support planning cleanup#7147
Open
yanghua wants to merge 5 commits into
Open
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Collaborator
Author
|
@claude review |
Xuanwo
reviewed
Jun 8, 2026
Xuanwo
left a comment
Collaborator
There was a problem hiding this comment.
I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.
Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background / Motivation
cleanup_old_versionstoday behaves as a black box: callers hand in a policyand either get a
RemovalStatsback or, on failure, a partially mutateddataset with no record of which files were inspected, kept, or deleted.
That opacity becomes painful in three scenarios we hit in production:
upcoming cleanup will remove (and how many bytes that frees) before
actually running it, especially on tables with 100k+ fragments where a
mistaken policy could remove tens of GB.
(commit hooks, schedulers), there is no artifact we can inspect afterwards
to answer "why did this file go away?". The
tracingaudit log helps, butonly if you were already capturing it.
execute on another (or in a maintenance window), which the current API
does not support at all.
This PR splits cleanup into an explicit plan and execute pair, while
keeping the existing
cleanup_old_versionsentry point byte-for-bytecompatible. The plan is a serializable description of every file the cleanup
intends to delete, the reason it qualifies, and the dataset snapshot it was
built from.
What's in this PR
plan_cleanup(&Dataset, CleanupPolicy) -> CleanupPlancleanup_with_plan(&Dataset, CleanupPlan) -> RemovalStatsCleanupPlan/CleanupFile/CleanupFileKind/CleanupFileReason/CleanupPlanStats/CleanupReferencedBranchdata types.CleanupTaskinto three explicit execution paths(
cleanup_old_versions/cleanup_with_plan/ commit hooks), each withits own trust model documented in-source.
cleanup_with_planvalidates: dataset URI, base path, that every plannedpath stays under the dataset base, and that the plan's
read_versionstill matches the storage-resolved latest version. A residual TOCTOU
window between the version check and the deletes is documented in the
rustdoc; callers running concurrent writers must serialize externally.
the in-memory dataset handle, so plans built from a stale handle are
still safe.
list_manifest_locationsdid not return the storage-resolved latest version (defends against
eventual-consistency or racing list output).
Behavior changes worth flagging
RemovalStatsreturned bycleanup_old_versionsnow includes the statsof cascaded
clean_referenced_branchescleanups. Previously those weresilently dropped. Monitoring/dashboards that compared against the old
numbers will see an increase.
cleanup_with_planwill reject a plan if any commit lands betweenplan_cleanupandcleanup_with_planon the same dataset. This is bydesign — see rustdoc. The internal
cleanup_old_versionspath isunaffected.
Tests
plan_cleanup_does_not_delete_filesplan_cleanup_uses_latest_version_with_stale_handlecleanup_with_plan_rejects_stale_versioncleanup_with_plan_rejects_toctou_commit_with_stale_handleinternal_cleanup_plan_allows_toctou_commit_before_deleteprocess_manifests_rejects_listing_missing_latest_versioncleanup_old_versions/cleanup_with_policytests continueto pass unmodified.